Optical Character Recognition of Arabic Text

Part 1 – getting ready for programmatic OCR

Thomas Hegghammer

thomas.hegghammer@all-souls.ox.ac.uk

15 September 2025

Overview

Today:

  • Part 1: Getting ready for programmatic OCR
  • Part 2: OCR on single-column text

Tomorrow:

  • Part 3: OCR on complex layouts

Schedule today

  • 09.30: Introductions and troubleshooting
  • 10.00: The OCR problem
  • 10.30: Programmatic OCR
  • 11.00: Break
  • 11.20: Programming essentials
  • 12.30: Overview of OCR tools and strategies
  • 1.30pm: Lunch
  • 2.30pm: OCR with local tools
  • 3.15pm: OCR with APIs
  • 4.00pm: Break
  • 4.15pm: Postprocessing with LLMs
  • 5pm: End

2. The OCR problem

https://www.youtube.com/watch?v=-VEJUzDFAZA

A tough nut to crack

  • One of the oldest and hardest problems in computer science

  • Major progress from late 2010s onwards thanks to convolutional neural networks

  • Mostly solved for English, but not for Arabic

Why so difficult?

  • Complex computer vision task
    • Extreme input variability (fonts, layout, degradation, languages)
    • Many ambiguities (e.g., O and 0, l and 1)
  • Multimodal problem: Involves combining visual pattern recognition, linguistic knowledge, and world knowledge
    • Humans are incredibly good at this
    • Only with AI/LLMs are we starting to get similar abilities in computers

Why is Arabic OCR not solved?

  • Less effort has gone into it
    • English has always been the priority in industry
    • Other languages in Latin script could piggyback on English
    • Less development, less training data for non-Latin scripts
  • Script peculiarities
    • Right to left, but often with Latin script mixed in
    • Different positional forms and connections
    • Diacritics
  • Currently among hardest languages to OCR
    • Chinese OCR performs better
    • Among big languages, only Urdu OCR performs worse

GUI options

Adobe Acrobat has no Arabic

Same with most online services

Adobe Acrobat Online:

onlineocr.net:

Those that have it perform poorly

E.g. convertio.io:

Similarly with ABBYY Finereader

Plus: Windows only, and $16+/month

Best GUI solution for now: Google Docs

  1. Upload to Google Drive
  2. Right-click, choose “Open with”, then “Google Docs”

But important limitations

  • No scaling
  • No adaptability, modularity

3. Programmatic OCR

Computer basics

  • Computers can be controlled with tools at different levels of abstraction
    • Binary code
    • Low-level languages: Assembly code, C
    • High-level languages: Python, Javascript, R, Unix, Powershell
    • Graphical user interfaces
  • All computers have a command line (aka terminal)
    • “Engine room” of computer
    • Language of command line differs by operating system (OS): Unix for MacOS/Linux, Powershell for Windows
  • Programming languages like Python and R are OS-independent ways of controlling a computer

Filesystem

  • All computers have a filesystem; a tree-structured hierarchy of folders and files
  • All folders and files have an “address”, called filepath
  • Path notation slightly different on Windows vs MacOS/Linux
    • Unix: /home/thomas/Pictures/photo.jpg
    • Windows: C:\Users\Thomas\Pictures\photo.jpg
  • Filepaths can be absolute (in full form) or relative (abbreviated, relative to your location)
  • Notion of “location” is important: When coding, you are always launching commands from some place in the filesystem

Programmatic tool types (1): Command line interface (CLI) programs

  • Programs that are launched directly in the terminal
  • Examples: Tesseract, Kraken, Surya, ImageMagick
  • Typically written in C or Python
  • Syntax same across operation system (but )
  • Many program exists both in CLI form and Python/R package form

Programmatic tool types (2): Python or R packages

  • Collections of functions for specialized tasks
  • Imported and used inside Python or R
  • Examples: pandas in Python or stringr in R
  • Names can vary: e.g. Tesseract implemented as pytesseract in Python and tesseract in R

Programmatic tool types (3): Application programming interfaces (APIs)

  • Communication channel to a remote service
  • E.g. Google Document AI API, OpenAI API
  • APIs use HTTP protocol, hence are programming language agnostic
  • APIs can be called from command line with curl
  • But easier to use in Python or R with tailored packages (e.g. google-cloud-documentai, daiR)

Tool types (4): Local models

  • Big files containing weights from AI training, open sourced for public use
  • Specialized for some task, e.g. text/image generation, text/image classification
  • Used from Python
  • In OCR context, we are typically interested in visual large language models, layout parsing models, and large language models (for postprocessing)

GitHub Codespaces - orientation

Git and GitHub

  • Git: a version control system
  • GitHub: cloud storage for Git repositories
  • Repository: a folder with contents specific to a single project (article, course, app)
  • Cloning: downloading a repository from GitHub
  1. Go to webpage of repository

  2. Locate the green “Code” button and click it

  3. Copy the clone URL

  4. Navigate to what you want to be the parent folder of the cloned repository

  5. Clone

git clone CLONE_URL
# To get the example files etc:
git clone https://github.com/Hegghammer/ocr-cairo

To pull updates from a repository you have already cloned, simply run:

# Make sure you are in the root of the repository
git pull

4. Programming essentials

Object manipulation (R)

  • Containers of information
  • Created with assignment operator <-
  • Different data structures
    • Vectors, dataframes, lists
  • Different data types
    • Character, numeric, integer, boolean
  • Access subsets with indices (starting at 1)
# Create scalar
my_text <- "Here is some text"

# Create vectors
texts <- c("Some text", 
           "Some more text", 
           "Even more"
           )
n_words <- c(2, 3, 2)

# Create dataframe
df <- data.frame(texts, n_words)

# Access second element of vector
texts[2]

# Access third row in n_words column
texts$n_words[3]

Object manipulation (Python)

  • Lists are the main object type
  • Created with = operator and square brackets
  • Dataframes created with pandas, not natively
  • Different data types: “strings”, “floats”, “integers”
  • Accessed through indices (starting at 0)
# Create object
my_text = "Here is some text"

# Create lists
texts = ["Some text", "Some more text", "Even more"]
n_words = [2, 3, 2]

# Create dataframe
import pandas as pd
df = pd.DataFrame({
    "Text": texts,
    "Length": n_words
})

# Access second element of list
texts[1]

# Access third row in Length column
df["Length"].iloc[2]

File and folder manipulation (R)

  • Paths are crucial
    • Relative and absolute
  • Know your working directory with getwd()
  • Get directory contents with list.files()
  • Create files/dirs with file.create(), dir.create()
  • Move with file.rename()
  • Delete with file.remove(), unlink()
# Create a directory and some files
dir.create("test")
file.create(c("test/notes1.txt", "test/notes2.txt"))

# Get content
contents <- list.files("test")

# Get full paths of content
contents_full <- list.files("test", full.names = TRUE)

# NB: Output of list.files is vector of paths, 
# not files themselves

b. File and folder manipulation (Python)

  • Primarily done with the os module
  • Get dir contents with os.listdir() or glob.glob()
  • Create files with open(FILE, "w")
  • Create dirs with os.makedirs()
  • Move with shutil module and move()
  • Delete with os.remove(), os.rmdir()
import os

# Create a directory and some files
os.makedirs("test", exist_ok=True)
open("test/notes1.txt", "w").close()
open("test/notes2.txt", "w").close()

# Get content
contents = os.listdir("test")

# Get full paths
import glob
contents_full = glob.glob("test/*")

# Move file
import shutil
shutil.move("test/notes1.txt", ".")

Iteration (R)

  • Repeating a procedure on several elements; crucial to scaling
  • Two main ways: “loops” and “apply functions”
  • Loop syntax:
for (SEQUENCE_DEFINITION) {
  INSTRUCTION
}
  • Sequence definition:
    • A variable representing element to be processed in each iteration
    • The word in
    • The vector to be iterated over
  • Two approaches:
    • loop over elements directly
    • loop over indices of elements (preferable)
# Basic loop
for (i in texts) {
  print(i)
}

# Loop over indices
for (i in seq_along(texts)) {
  print(texts[i])
}

# Indices let you do more things
for (i in seq_along(texts)) {
  message("Processing text", i, " of ", length(texts))
  new_name <- paste0("text_", i)
  write(texts[i], new_name)
}

Iteration (Python)

  • Very similar to R
  • Indentation matters
  • Use enumerate() to get both index & value
# Basic loop
for t in texts:
    print(t)

# Loop over indices    
for i in range(len(texts)):
    print(texts[i])

# Indices let us do more things
for i, t in enumerate(texts):
    print(f"Processing text {i+1} of {len(texts)}")
    new_name = f"text_{i}.txt"
    with open(new_name, "w") as f:
        f.write(t)

Function building (R)

  • Powerful when combined with iteration
  • Basic syntax:
FUNCTION_NAME <- function() {
  INSTRUCTION
}
  • You can add parameters; these become variables in the instruction
# Basic function
say_hello <- function() {
  print("hello!")
}
say_hello()

# One parameter
greet <- function(name) {
  greeting <- paste0("Hello, ", name, "!")
  print(greeting)
}
greet("Rami")

# Two parameters
greet_n <- function(name, n_times) {
  greeting <- paste0("Hello, ", name, "!")
  rep(greeting, n_times)
}
greet_n("Nada", 5)

Function building (Python)

  • Again similar to R, but even simpler
  • Use def and colon
  • Indentation matters
# Basic function
def say_hello():
    print("hello!")

say_hello()

# One param
def greet(name):
    greeting = f"Hello, {name}!"
    print(greeting)

greet("Rami")

# Two params
def greet_n(name, n_times):
    greeting = f"Hello, {name}!"
    return [greeting] * n_times

print(greet_n("Nada", 5))

Text manipulation (R)

  • Load with readr:read_file()
  • Save with write()
  • Manipulate with stringr functions
  • NB: manipulation often involves Regex
library(lorem)
library(readr)
library(stringr)
library(tokenizers)

# Create random content
text <- as.character(ipsum(3))

# Save to file
write(text, "sample.txt")

# Load from file
same_text <- read_file("sample.txt")

# Do things with it
count_words(same_text)
smileys <- str_replace_all(same_text, "\\.", " 🙂")
write(smileys, "smileys.txt")

Text manipulation (Python)

  • Most counterintuitive is the procedure for writing and reading text files:
with open(FILENAME, "w") as f:
    f.write(TEXT)
    
with open(FILENAME, "r") as f:
    TEXT = f.read()
  • Regex module re lets you substitute and find things
import lorem
import re

# Create random content
text = lorem.paragraph() * 3

# Save to file
with open("sample.txt", "w") as f:
    f.write(text)

# Load from file
with open("sample.txt", "r") as f:
    same_text = f.read()

# Do things with it
word_count = len(same_text.split())

smileys = re.sub(r"\.", " 🙂", same_text)

with open("smileys.txt", "w", encoding="utf-8") as f:
    f.write(smileys)

Image manipulation (R)

  • magick package is key, lets us:
  • Inspect
  • Convert
  • Add effects
  • Crop
library(magick)

# Load an example jpeg
files <- list.files("example_docs/columns/orig",
  full.names = TRUE
)
img <- image_read(files[1])
image_info(img)

# Get specific page of a PDF
img2 <- image_read_pdf(files[2], pages = 1)

# Make greyscale
img2_grey <- image_convert(img2, type = "Grayscale")

# crop a 200x150 region starting at (50, 100)
img2_crop <- image_crop(img2, "200x150+50+100")

# Save (use this to convert)
image_write(img2_crop, "test.png", format = "png")

Image manipulation (Python)

  • Many ways to manipulate images in Python, but “Pillow” is the simplest
import glob
from PIL import Image

# Load an example jpeg
files = glob.glob("example_docs/columns/orig/*")
img = Image.open(files[0])

# Make greyscale
img_grey = img.convert("L")

# Crop a 200x150 region starting at (50, 100)
# PIL crop uses coordinates 
# (left, top, right, bottom)
img_crop = img.crop((50, 100, 250, 250))

# Save (use this to convert)
img_crop.save("test.png", format="PNG")

OCR evaluation (R)

  • “Levenshtein distance”: Minimum number of edits to make texts identical
  • Character error rate (CER) and word error rate (WER)
  • Alternatively, “accuracy” (1 minus error)
  • Easiest with jiwer command line program, accessed in R via system()
  • Syntax:
jiwer -r REFERENCE -h HYPOTHESIS
# On command line
pip install jiwer


## WER
command <- "jiwer -g -r sample.txt -h smileys.txt"
wer <- as.numeric(system(command, intern = TRUE))
wer
# Inspect
system("jiwer -g -a -r sample.txt -h smileys.txt")

# CER (add -c)
command <- "jiwer -g -c -r sample.txt -h smileys.txt"
cer <- as.numeric(system(command, intern = TRUE))
cer
# Inspect
system("jiwer -g -a -c -r sample.txt -h smileys.txt")

g. OCR evaluation (Python)

  • Use jiwer Python package
import jiwer

# Read the files
with open("sample.txt", "r") as f:
    ref = f.read()
with open("smileys.txt", "r") as f:
    hyp = f.read()

## WER
wer = jiwer.wer(ref, hyp)
wer

# Inspect
jiwer.process_words(ref, hyp)

# CER
cer = jiwer.cer(ref, hyp)
cer

# Inspect
jiwer.process_characters(ref, hyp)

6. OCR tools and strategies

Three main challenges

  • Visual noise (speckles, stains, skew, bleedthrough)
  • Unusual script (old typefaces, handwriting)
  • Complex layouts (newspapers, magazines, tables)

Character recognition vs layout parsing

  • Two very different problems

  • Most OCR engines try to do both, but often fail on complex layouts

  • Complex layouts often require “the cutout approach”:

    • Identify text blocks
    • Cut them out programmatically
    • OCR them individually
    • Reassemble

Two main strategies: 1) Straight OCR; 2) The cutout approach

Tool types

  • Regular engines
    • Local: Tesseract, Kraken, Surya, EasyOCR
    • Remote: Google Document AI
  • Visual large language models
    • Local: OLM, Qwen, any LLM with vision
    • Remote: Mistral OCR, ChatGPT, any LLM with vision
  • Layout parsers
    • Pretrained: Surya
    • Custom trained: YOLO, LayoutParser

Decision tree

  1. How complex is the layout?
    • Low -> Straight OCR
    • High -> Cutout approach
  2. Do I mind using a remote service? (for privacy or cost reasons)
    • No -> Use Google Document AI or Mistral OCR
    • Yes -> Use a local engine or VLLM

Currently, regular engines perform slightly better than VLLMs, but this may well change